Title Similarity-Based Feature Weighting for Text Categorization

نویسندگان

Shane Bergsma

Dekang Lin

چکیده

In automated text categorization, a system analyzes a natural-language document to decide whether it belongs in one or more of a group of pre-defined categories. The typical approach is to represent the documents using feature vectors, and inductively generate a classifier based on a training set of documents and their manually-assigned categories. Such a process ignores information on word order, syntax, and other heuristics that might aid in identifying good features for categorization. Recently, more attention has been paid to using deeper natural language processing techniques to improve the performance of the standard classifiers. One such approach, which takes advantage of a previously-generated thesaurus of lexical similarities, is studied in this project. This system identifies key-words in the text by looking for terms with high similarity to the terms in the title field. A database of automatically-clustered dependency-based word similarities is used to identify the similar words. Experiments show increased weighting of key terms aids the effectiveness of text categorization for a number of topics in the standard Reuters newswire corpus.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm

It is usually true that some structures like title can express the main content of texts, and these structures may have an influence on the effectiveness of text categorization. However, the most common feature weighting algorithms, called term frequency-inverse document frequency (TF-IDF) doesn’t think about the structural information of texts. To solve this problem, a new feature weighting al...

متن کامل

Enriched Format Text Categorization Using A Component Similarity Approach

Text categorization has been widely studied for years. However, conventional plain text categorization approaches which work good in plain text behave poor when they are simply applied to enriched format texts. An categorization approach that is applicable to enriched format text is proposed. During feature selection, we get feature structure distribution weight by using extended structure mode...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization

The performance of any algorithm for text classification are reflected in the of reliability classification results and classification algorithm is high efficient. We analyze the space-time efficiency of different stages based on the traditional KNN algorithm process for Chinese text classification and ensure the reliability of classification. And we optimize efficiency of the algorithm and the...

متن کامل

Compherensive Review Of Text Classification Using Machine Learning

Text Classification, also known as text categorization, is the task of automatically allocating unlabeled documents into predefined categories. Text Classification means allocating a document to one or more categories or classes. The ability to accurately perform a classification task depends on the representations of documents to be classified. Text representations transform the textural docum...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2004

Title Similarity-Based Feature Weighting for Text Categorization

نویسندگان

چکیده

منابع مشابه

Feature Weighting Improvement of Web Text Categorization Based on Particle Swarm Optimization Algorithm

Enriched Format Text Categorization Using A Component Similarity Approach

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

The Analysis and Optimization of KNN Algorithm Space-Time Efficiency for Chinese Text Categorization

Compherensive Review Of Text Classification Using Machine Learning

عنوان ژورنال:

اشتراک گذاری